model-parallel training
Reviews: Ouroboros: On Accelerating Training of Transformer-Based Language Models
The paper introduces a new method for model-parallel training, where layers of a model are distributed across multiple accelerators. The method avoids locking in the backward pass by using stale gradients during back-propagation. I'm not aware of any prior work that took such an approach. Furthermore, the authors provide theoretical claims and empirical results to demonstrate that their method has convergence properties similar to conventional SGD, despite using stale gradients. The lack of effective model-parallel training is a major roadblock for scaling up model sizes, and the proposed approach promises to overcome this issue.
A guide to the field of Deep Learning
Since the list has gotten rather long, I have included an excerpt above; the full list is at the bottom of this post. At the entry level, the datasets used are small. Often, they easily fit into the main memory. If they don't already come pre-processed then it's only a few lines of code to apply such operations. Mainly you'll do so for the major domains Audio, Image, Time-series, and Text. Before diving into the large field of Deep Learning it's a good choice to study the basic techniques.
HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training using TensorFlow
Awan, Ammar Ahmad, Jain, Arpan, Anthony, Quentin, Subramoni, Hari, Panda, Dhabaleswar K.
The enormous amount of data and computation required to train DNNs have led to the rise of various parallelization strategies. Broadly, there are two strategies: 1) Data-Parallelism -- replicating the DNN on multiple processes and training on different training samples, and 2) Model-Parallelism -- dividing elements of the DNN itself into partitions across different processes. While data-parallelism has been extensively studied and developed, model-parallelism has received less attention as it is non-trivial to split the model across processes. In this paper, we propose HyPar-Flow: a framework for scalable and user-transparent parallel training of very large DNNs (up to 5,000 layers). We exploit TensorFlow's Eager Execution features and Keras APIs for model definition and distribution. HyPar-Flow exposes a simple API to offer data, model, and hybrid (model + data) parallel training for models defined using the Keras API. Under the hood, we introduce MPI communication primitives like send and recv on layer boundaries for data exchange between model-partitions and allreduce for gradient exchange across model-replicas. Our proposed designs in HyPar-Flow offer up to 3.1x speedup over sequential training for ResNet-110 and up to 1.6x speedup over Horovod-based data-parallel training for ResNet-1001; a model that has 1,001 layers and 30 million parameters. We provide an in-depth performance characterization of the HyPar-Flow framework on multiple HPC systems with diverse CPU architectures including Intel Xeon(s) and AMD EPYC. HyPar-Flow provides 110x speed up on 128 nodes of the Stampede2 cluster at TACC for hybrid-parallel training of ResNet-1001.
- North America > United States > Ohio > Franklin County > Columbus (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > New Jersey > Middlesex County > Piscataway (0.04)
- North America > United States > Texas (0.04)
- Research Report (0.65)
- Overview (0.46)
Local Critic Training for Model-Parallel Learning of Deep Neural Networks
This paper proposes a novel approach to train deep neural networks in a parallelized manner by unlocking the layer-wise dependency of backpropagation training. The approach employs additional modules called local critic networks besides the main network model to be trained, which estimate the output of the main network in order to obtain error gradients without complete feedforward and backward propagation processes. We propose a cascaded learning strategy for these local networks so that parallelized training of different layer groups is possible. Experimental results show the effectiveness of the proposed approach and suggest guidelines for determining appropriate algorithm parameters. In addition, we demonstrate that the approach can be also used for structural optimization of neural networks, computationally efficient progressive inference, and ensemble classification for performance improvement.
- Oceania > Australia > New South Wales > Sydney (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)